Built with R Studio 3.5.1 on Sun Nov 25 23:27:08 2018
This report explores a dataset that contains the quality rating and 11 numeric variables describing various attributes of 4898 white wines 1. The guiding question here is to see which chemical properties influence the quality of white wines.
NOTE: The accompanying file, wineQualityInfo.txt, indicates that there are no missing attribute values.
## 'data.frame': 4898 obs. of 13 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ fixed.acidity : num 7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
## $ volatile.acidity : num 0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
## $ citric.acid : num 0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
## $ residual.sugar : num 20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
## $ chlorides : num 0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
## $ free.sulfur.dioxide : num 45 14 30 47 47 30 30 45 14 28 ...
## $ total.sulfur.dioxide: num 170 132 97 186 186 97 136 170 132 129 ...
## $ density : num 1.001 0.994 0.995 0.996 0.996 ...
## $ pH : num 3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
## $ sulphates : num 0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
## $ alcohol : num 8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
## $ quality : int 6 6 6 6 6 6 6 6 6 6 ...
From the structure table above we see that the quality score is defined as an integer variable. By transforming this to a factor variable we will be able to use it in later plots for classification; this new variable is added to the dataset.
##
## 0 1 2 3 4 5 6 7 8 9 10
## 0 0 0 20 163 1457 2198 880 175 5 0
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 3.800 Min. :0.0800 Min. :0.0000 Min. : 0.600
## 1st Qu.: 6.300 1st Qu.:0.2100 1st Qu.:0.2700 1st Qu.: 1.700
## Median : 6.800 Median :0.2600 Median :0.3200 Median : 5.200
## Mean : 6.855 Mean :0.2782 Mean :0.3342 Mean : 6.391
## 3rd Qu.: 7.300 3rd Qu.:0.3200 3rd Qu.:0.3900 3rd Qu.: 9.900
## Max. :14.200 Max. :1.1000 Max. :1.6600 Max. :65.800
##
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.00900 Min. : 2.00 Min. : 9.0
## 1st Qu.:0.03600 1st Qu.: 23.00 1st Qu.:108.0
## Median :0.04300 Median : 34.00 Median :134.0
## Mean :0.04577 Mean : 35.31 Mean :138.4
## 3rd Qu.:0.05000 3rd Qu.: 46.00 3rd Qu.:167.0
## Max. :0.34600 Max. :289.00 Max. :440.0
##
## density pH sulphates alcohol
## Min. :0.9871 Min. :2.720 Min. :0.2200 Min. : 8.00
## 1st Qu.:0.9917 1st Qu.:3.090 1st Qu.:0.4100 1st Qu.: 9.50
## Median :0.9937 Median :3.180 Median :0.4700 Median :10.40
## Mean :0.9940 Mean :3.188 Mean :0.4898 Mean :10.51
## 3rd Qu.:0.9961 3rd Qu.:3.280 3rd Qu.:0.5500 3rd Qu.:11.40
## Max. :1.0390 Max. :3.820 Max. :1.0800 Max. :14.20
##
## quality quality.factor
## Min. :3.000 6 :2198
## 1st Qu.:5.000 5 :1457
## Median :6.000 7 : 880
## Mean :5.878 8 : 175
## 3rd Qu.:6.000 4 : 163
## Max. :9.000 3 : 20
## (Other): 5
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 5.878 6.000 9.000
##
## 6 5 7 8 4 3 9 0 1 2 10
## 0.449 0.297 0.180 0.036 0.033 0.004 0.001 0.000 0.000 0.000 0.000
The output variable, quality - a score from 0 (very bad) to 10 (very excellent) - follows a normal distribution, with the large majority in the 6 and 5 range. The proportions, sorted, for each of the quality scores is shown below the summary.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.50 10.40 10.51 11.40 14.20
The distribution of alcohol in white wines ranges from 8 to 14 percent alcohol by volume, and has a very positively skewed shape. This may be a factor of an unbalanced dataset with a large proportion of dry wines that are low in residual sugar (as can be seen in the following plot).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 5.200 6.391 9.900 65.800
The plot for residual sugar shows a positively skewed distribution with a very long tail, all data points. Transforming the data on a log10 scale results in a bi-modal distribution with a peak at around 1.1 and another at around 9.
A categorical variable for residual sugar (g/dm3) is created as a way to transform the data. Sweetness is defined as follows, from “Sweetness of Wine - Wikipedia” 2 :
| Category | Dry | Medium dry | Medium | Sweet |
|---|---|---|---|---|
| sugar content | up to 4 g/L | up to 12 g/L | up to 45 g/L | more than 45 g/L |
NOTE: dm3 = litre (L)
##
## Dry Medium-Dry Medium Sweet
## 2097 1975 825 1
## Dry Medium-Dry Medium Sweet
## 0.428 0.403 0.168 0.000
The summary and proportions of the sweetness levels are shown above, clearly showing that the majority of the wines are either Dry or Medium-dry. The following pie chart shows the relative proportions.
## df$sweetness.level: Dry
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.300 1.600 1.823 2.200 4.000
## --------------------------------------------------------
## df$sweetness.level: Medium-Dry
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.100 5.800 7.500 7.641 9.300 12.000
## --------------------------------------------------------
## df$sweetness.level: Medium
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12.05 13.10 14.45 14.94 16.20 31.60
## --------------------------------------------------------
## df$sweetness.level: Sweet
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 65.8 65.8 65.8 65.8 65.8 65.8
And above, the summaries of residual sugar by sweetness level.
This plot shows the same log-transformation plot from above, with the sweetness levels used to colour the respective areas.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.720 3.090 3.180 3.188 3.280 3.820
The pH levels range from 2.72 to 3.82, with the mean and median almost the same, at 3.180 and 3.188 respectively. The plot shows a classic normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.800 6.300 6.800 6.855 7.300 14.200
Fixed acidity (FA), which measures tartaric acid, follows a normal distribution and there are a couple of outliers in the right-hand tail.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.2700 0.3200 0.3342 0.3900 1.6600
Citric acid, which is added to improve “freshness”, has a normal distribution and several outliers beyond 0.75 g/dm3 in the right-hand tail, and a spike at just under 0.5 g/dm3.
According to the Code of Federal Regulations (CFR), the allowable limit for volatile acidity (VA) is 1.1000 g/L 3. High levels of VA can lead to a vinegar taste, which is considered a wine fault; many people can detect this at levels above 600 mg/L 4, or 0.6 g/L.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2100 0.2600 0.2782 0.3200 1.1000
The first plot for volatile acidity (VA), measuring acetic acid, has a pronounced right-skew and a long right-hand tail. Transforming the variable to log10 scale reveals it’s a log-normal distribution!
The summary shows the max VA value to be 1.1000 g/L, which falls within allowable limits.
I will add a factored variable for VA to use later for easier comparisons. Values equal to or greater than 0.6 g/L will be categorized as “high_VA”, the rest “low_VA”.
## low_VA high_VA
## 4825 73
## low_VA high_VA
## 0.985 0.015
Above, we have the summary and proportions of the volatile acidity levels.
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.891
## 3rd Qu.:6.000
## Max. :9.000
## quality
## Min. :3.000
## 1st Qu.:4.000
## Median :5.000
## Mean :5.014
## 3rd Qu.:6.000
## Max. :8.000
Here we subset the wines by VA level and see that there are only 73 wines characterised as higher-VA; only 1% of the wines. From the first summary above, we see that lower-VA wines have a median and mean quality scores of 6 and 5.9 respectively, while the higher-VA wines are roughly the same at about 5.0. I will look further into this later on.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2200 0.4100 0.4700 0.4898 0.5500 1.0800
The first plot for sulphates appears to have a slightly positively skewed normal distribution. With a log transform we have a log-normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 108.0 134.0 138.4 167.0 440.0
Total sulfur dioxide looks to be normally distributed, but there are a number of outliers in the right-hand tail.
Sulfur dioxide is important for keeping wines fresh, however, levels of free sulfur dioxide over 50 ppm can be sensed and are undesirable.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 2.00 23.00 34.00 35.31 46.00 289.00
Similarly, the distribution for free sulfur dioxide has a very slight positive skew with several outliers in the tail. The vertical line at 50 ppm (ppm is equivalent to mg/L), indicates the concentration at which free sulfur dioxide becomes noticeable in the smell and taste of the wine. In the second plot I’ve set a limit to zoom in on the bulk of the data points.
I will add a factored variable with 50 ppm as the boundary point, so we can take a closer look at it in later sections.
## low_SO2 high_SO2
## 4030 868
## low_SO2 high_SO2
## 0.823 0.177
The summary and proportions of the high and low free sulfur dioxide levels are calculated above.
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.915
## 3rd Qu.:6.000
## Max. :9.000
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.705
## 3rd Qu.:6.000
## Max. :9.000
In the faceted plot above we can see there is not much difference in the distributions of wines with low SO2 levels vs. high SO2 levels. The summaries likewise show almost identical values for all the parameters in low SO2 (appearing first) and high SO2 (second).
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600
The distribution for chlorides is normally shaped, but with a very long tail of outliers beyond 0.075 g/dm3. Zooming in we get a better look at the normal distribution.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9917 0.9937 0.9940 0.9961 1.0390
The distribution for density is not very dispersed, but also has several outliers. This attribute is a function of several other variables, mainly residual sugar and alcohol (as will be seen in the correlation matrix below).
There are 4898 white wines in the dataset, with values for 11 attributes and 1 output variable. All attributes and the output (quality) are numeric values.
Observations based on attributes:
The most common quality ratings are 6 and then 5, accounting for 45% and 30% of the wines respectively.
Most wines in the dataset have low residual sugars: they are classified as Dry (<4 g/L) or Medium-dry (<12 g/L) wines. These wines account for 42% and 40% of the wines respectively.
The median value for alcohol (% by volume) is 10.4, and the mean is just slightly higher at 10.5. The interquantile range (IQR) is from 9.5 to 11.4.
Almost 18% of wines have a free sulfur dioxide level greater than 50 ppm.
Only 1% of wines have higher VA levels above 600 mg/L.
The main features of the dataset appear to be alcohol, residual sugar and density. These attributes are highly correlated (seen below in the correlation matrix of the Bivariate Plots section), as alcohol is a byproduct of the fermentation process. I would like to see whether there are any other variables that may have an impact on these features or the quality score.
From the correlation matrix in the Bivariate Plots section we see that the strongest correlations are between density and residual sugar, and then with alcohol; this is as I expected (fermentation process).
There also is a moderate correlation between total sulfur dioxide (SO2) and density, as well as with alcohol. As total SO2 also contains the free SO2, and it is this free form that is detectable, I will add this to the features of interest.
I created a factorised variable for sweetness level based on the sweetness categories “Dry”, “Medium-dry”, “Medium” and “Sweet”.
I created a simple factorised variable for high/low volatile acidity (VA) levels; acetic acid can be detected in wines as a vinegary taste, and this may have an effect on the quality score.
I also created a similar factorised variable for high/low free sulfur dioxide (SO2) levels; this can also be detected as an unpleasant smell and taste. Likewise, higher levels of this attribute may affect the quality score.
The distribution for residual sugar is heavily skewed and oddly shaped, and the majority of wines have small levels.
I created a factorised version of the quality score; the original variable remains unchanged. This factor variable makes it easier to create boxplots with other variables.
I also plotted a log-transform for residual sugars; the transformed distribution is revealed to be bi-modal with a peak at around 1.1 and another at around 9.
Plotting a log-transform for volatile acidity as well as sulphates also revealed log-normal distributions.
## fixed.acidity volatile.acidity citric.acid
## fixed.acidity 1.000 -0.023 0.289
## volatile.acidity -0.023 1.000 -0.149
## citric.acid 0.289 -0.149 1.000
## residual.sugar 0.089 0.064 0.094
## chlorides 0.023 0.071 0.114
## free.sulfur.dioxide -0.049 -0.097 0.094
## total.sulfur.dioxide 0.091 0.089 0.121
## density 0.265 0.027 0.150
## pH -0.426 -0.032 -0.164
## sulphates -0.017 -0.036 0.062
## alcohol -0.121 0.068 -0.076
## quality -0.114 -0.195 -0.009
## residual.sugar chlorides free.sulfur.dioxide
## fixed.acidity 0.089 0.023 -0.049
## volatile.acidity 0.064 0.071 -0.097
## citric.acid 0.094 0.114 0.094
## residual.sugar 1.000 0.089 0.299
## chlorides 0.089 1.000 0.101
## free.sulfur.dioxide 0.299 0.101 1.000
## total.sulfur.dioxide 0.401 0.199 0.616
## density 0.839 0.257 0.294
## pH -0.194 -0.090 -0.001
## sulphates -0.027 0.017 0.059
## alcohol -0.451 -0.360 -0.250
## quality -0.098 -0.210 0.008
## total.sulfur.dioxide density pH sulphates alcohol
## fixed.acidity 0.091 0.265 -0.426 -0.017 -0.121
## volatile.acidity 0.089 0.027 -0.032 -0.036 0.068
## citric.acid 0.121 0.150 -0.164 0.062 -0.076
## residual.sugar 0.401 0.839 -0.194 -0.027 -0.451
## chlorides 0.199 0.257 -0.090 0.017 -0.360
## free.sulfur.dioxide 0.616 0.294 -0.001 0.059 -0.250
## total.sulfur.dioxide 1.000 0.530 0.002 0.135 -0.449
## density 0.530 1.000 -0.094 0.074 -0.780
## pH 0.002 -0.094 1.000 0.156 0.121
## sulphates 0.135 0.074 0.156 1.000 -0.017
## alcohol -0.449 -0.780 0.121 -0.017 1.000
## quality -0.175 -0.307 0.099 0.054 0.436
## quality
## fixed.acidity -0.114
## volatile.acidity -0.195
## citric.acid -0.009
## residual.sugar -0.098
## chlorides -0.210
## free.sulfur.dioxide 0.008
## total.sulfur.dioxide -0.175
## density -0.307
## pH 0.099
## sulphates 0.054
## alcohol 0.436
## quality 1.000
The density of a wine correlates with the key parts of the fermentation process: residual sugar and alcohol. Residual sugar is also inversely correlated with alcohol, although only moderately; this makes sense since sugar levels decrease as it is metabolised during fermentation and alcohol is produced, any sugar remaining is residual sugar. Total sulfur dioxide also correlates with density, and to lesser extents with alcohol and residual sugar.
I want to look at the features with the highest correlations: alcohol, density, residual sugar and total sulfur dioxide.
First, I’ll create a new variable with 3 factors based on quality, so the bins are more similar in size.
poor [0-5], good [6], and excellent [7-10]
##
## poor good excellent
## 1640 2198 1060
## [,1]
## fixed.acidity -0.10315405
## volatile.acidity -0.18292444
## citric.acid -0.02030053
## residual.sugar -0.12550624
## chlorides -0.22131592
## free.sulfur.dioxide -0.01397733
## total.sulfur.dioxide -0.20118454
## density -0.33249501
## pH 0.10642160
## sulphates 0.06002660
## alcohol 0.46316492
## quality 0.94906065
Correlation coefficients for the simplified quality levels aren’t much different from the original quality scores.
Both of these plots show a distinct relationship reflecting the moderate, positive correlation between alcohol and quality, with a coefficient of 0.44. The wines with higher quality scores seem to have higher levels of alcohol.
Now using the simplified quality level - poor, good and excellent - let’s see how the poor and excellent compare.
## df$quality.level: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.00 9.20 9.60 9.85 10.40 13.60
## --------------------------------------------------------
## df$quality.level: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 9.60 10.50 10.58 11.40 14.00
## --------------------------------------------------------
## df$quality.level: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 8.50 10.70 11.50 11.42 12.40 14.20
The plot with the combined, simplified quality levels shows the difference in distribution between the poor and excellent quality wines. The median alcohol level for poor wines is much lower than for the excellent wines, and the 75th quantile for poor quality level wines (quality 0 to 5) is even slightly below the 25th quantile for excellent quality wines; the overlap between poor and excellent is small.
When removing the 3 extreme outliers, shown above the blue line at 1.005 in the first plot, and zooming in, it seems like there is something of a relationship between quality and density. Despite the look of it, the correlation is weak, with a coefficient of just -0.31.
## df$quality.level: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9872 0.9932 0.9951 0.9952 0.9971 1.0024
## --------------------------------------------------------
## df$quality.level: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9876 0.9917 0.9937 0.9940 0.9959 1.0390
## --------------------------------------------------------
## df$quality.level: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9871 0.9905 0.9917 0.9924 0.9936 1.0006
The violin plot, with the 25th quantile, median and 75th quantile, shows quite well how the distributions for density differ between poor and excellent quality wines.
The first plot, containing all data points, includes some extreme outliers making it difficult to see the detail. In the second point plot I’ve set an upper limit of 23 g/dm3 on residual sugar (RS) to exlude those extremes, and now we see the variation among the quality factors. The slight downward slope of the smoothing line indicates little correlation, reflecting a coefficient of -0.10.
Let’s see how the distributions for RS compare between the simplified quality level variables, poor, good and excellent.
## df$quality.level: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.600 1.700 6.625 7.054 11.025 23.500
## --------------------------------------------------------
## df$quality.level: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.700 1.700 5.300 6.442 9.900 65.800
## --------------------------------------------------------
## df$quality.level: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.800 1.800 3.875 5.262 7.400 19.250
##
## Pearson's product-moment correlation
##
## data: as.numeric(df$quality.level) and df$residual.sugar
## t = -8.8518, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1529750 -0.0978437
## sample estimates:
## cor
## -0.1255062
When using the simplified quality levels, there isn’t much of a difference in correlation coefficient, -0.13 here. We do, however, see a difference in variation between the levels: the median residual.sugar levels and the variance for excellent quality wines is less than for poor wines.
This plot shows a very tight concentration on the left, but then an outlier that is so far removed from the rest. Is this outlier valid data? I’ll remove this one, and the next 2, from the next plots.
Subsetting the dataset eliminates the 3 outliers, and we can see the strong relation between alcohol and density, echoing the correlation coefficient of -0.78. It looks like wines with a higher density have lower alcohol levels.
With a reverse transform on the y-axis (alcohol), the points are shifted and seem to curve slightly, as does the smoothing line. I wonder what non-linear function could be used to transform the plot to a linear form?
The first plot with all data points shows a tight concentration in the lower left quadrant and a few extreme outliers. The second plot zooms in and provides a good picture of the relationship between density and residual sugar (RS): wines with higher RS also have higher levels of density. The correlation coefficient for this pair is 0.84, indicating a very strong relationship.
In the first plot we see that there is a large concentration of wines with very low levels of residual sugar (RS). Zooming in, the second plot shows only those wines with RS levels below 23 g/dm3, and there is still a tight group of wines with very low RS levels at all levels of alcohol. This makes sense, as fermentation can only continue in the prescense of sugars, and different types of grapes will have differing levels of sugars (fructose and glucose) to begin with 5. As alcohol is produced, RS will decrease. The correlation coefficient for these variables was found to be -0.45, which is moderate.
This is related to the previous plots for residual.sugar vs. alcohol, as the sweetness level is a factorised variable based on residual sugar. The boxplot shows the relationship between alcohol and sweetness level: the median value for Dry wines is the highest at about 11, and it’s IQR is about the same as for Medium-Dry wines. The distribution of Medium wines has a much narrower IQR and more outliers in it’s long tail.
The first plot shows a tight concentration of points in the lower left quadrant trending in an upward slant with a few outliers. Zooming in on this main area, there is a good sense of following a line: as total sulfur dioxide increases density also increases. A coefficient of 0.53 between density and total sulfur dioxide indicates a moderate correlation.
Zooming in on the loose mass of points in the second plot, and removing 6 outliers, there is a hint of a relationship which reflects a moderate correlation of -0.45 between alcohol and total sulfur dioxide; as levels of total sulfur dioxide increase, alcohol levels tend to decrease.
The plot shows a tight concentration of points in the lower left quadrant with a few outliers. Zooming in on this main area we see the mass of points is not structured or organized in any linear or uniform way. This reflects the essentially zero correlation of 0.06 between residual sugar and volatile acidity.
I’m going to use the simplified quality level variable here instead of the quality factors to see if there’s anything here.
## df$quality.level: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 9.0 117.0 149.0 148.6 182.0 440.0
## --------------------------------------------------------
## df$quality.level: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.0 107.2 132.0 137.0 164.0 294.0
## --------------------------------------------------------
## df$quality.level: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 34.0 101.0 122.0 125.2 146.0 229.0
##
## Pearson's product-moment correlation
##
## data: as.numeric(df$quality.level) and df$total.sulfur.dioxide
## t = -14.371, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.2279069 -0.1741594
## sample estimates:
## cor
## -0.2011845
The scatter plot shows a wider variance of total sulfur dioxide levels (TSO2) for poor quality wines; the interquantile range (IQR) is over 65 g/L compared to about 45 for excellent quality wines. The faint downward slope of the smoothing line suggests that there really isn’t any relation here; the correlation coefficient for the pair is -0.20.
Volatile acidity shows similar tendencies as for total sulfur dioxide: a vague hint of a trend, the correlation coefficient of -0.20 is extremely weak. The second boxplot displays the distribution of wines with higher-VA: although there are few of these, the medians are all very close.
The scatter plot shows the very shallow slope, reflecting the weak correlation.
When using the simplified quality levels, poor, good and excellent, we have this plot.
## df$quality.level: poor
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.1000 0.2400 0.2900 0.3103 0.3500 1.1000
## --------------------------------------------------------
## df$quality.level: good
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.2000 0.2500 0.2606 0.3000 0.9650
## --------------------------------------------------------
## df$quality.level: excellent
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0800 0.1900 0.2500 0.2653 0.3200 0.7600
There really isn’t much difference between poor and excellent.
The boxplot shows little variation in the median across quality scores, and the variance for quality 6 and 5 can be explained by the fact that there are far more wines in these categories. With an upper limit set at 0.75 in the scatter plot, we can see the distributions more clearly. It doesn’t look like there is any trend here for citric acid, reflecting the low correlation value of -0.17.
In the first plot we see that there is a large concentration of wines with low levels of chlorides. In the second plot, a limit of 0.08 g/dm3 is set on the x-axis, and this smoothing line appears to be pulled out of alignment with the mass of points. The correlation coefficient for these variables was found to be -0.36, which is quite weak, even though the bulk of the points does suggest something of a trend.
The attribute with the highest correlation with the quality score is alcohol, and this is only a moderate correlation. Alcohol does have a strong correlation with density, and density, in turn, has an even stronger correlation with residual sugar (RS).
From the plots so far, it looks like wines with higher quality scores tend to have higher levels of alcohol. Looking just at the categories with higher counts of wines, the median level of alcohol for quality 5 wines is about 9.5 (%/volume) while it’s about 10.5 for quality 6 wines and 11.5 for quality 7 wines. Using the simplified categories gives us similar results, the medians are about 9.5 for poor and 11.5 for excellent.
Wine with higher density has higher levels of residual sugar (RS) and lower levels of alcohol. These relationships with density are highly correlated, reflecting the main processes involved in fermentation.
It is interesting, despite a weak correlation between alcohol and RS, there does appear to be a trend that as RS increases, alcohol decreases.
Even though total sulfur dioxide (TSO2) is moderately correlated with density, there is a hint that wines having higher levels of TSO2 tend to have a higher density. This makes sense from a molecular standpoint!
The relationship between alcohol and chlorides looks interesting; one area of the plot looked quite organized, while there were a large number scattered loosely further out.
The strongest relationships were between density and residual sugar, with a strong positive correlation of 0.84, and between density and alcohol, with a strong negative correlation of -0.78.
In these density plots we see that the highest quality wines (9) have a much higher tendency to have high levels of alcohol, and the poorest quality tend to have the lowest. In the plot with simplified quality levels - excellent, good and poor - the combined group of excellent quality wines (7-9) still do tend to have higher alcohol levels, and that the poor(er) quality wines (3-5) also stilll have a greater tendency to have low alcohol levels.
Sweet wines are predominantly associated with lower alcohol levels, and Dry wines tend to have higher alcohol levels than Medium wines.
It looks like most of the wines with high levels of free sulfur dioxide (high_SO2 is above 50 ppm) tend to have lower alcohol levels. For wines with lower free SO2 levels, there is not much variance across levels of alcohol.
For volatile acidity (VA), there’s not much difference between higher-VA and lower-VA wines, both categories show similar distributions.
Here we see how the distribution for wine density differs between poor and excellent quality wines, it’s essentially the same as the violin plot seen in the previous Bivariate Plots section. Within the excellent wines group, there are proportionately more with lower densities, while the other groups exhibit more balanced proportions.
The differentiation by sweetness level is more pronounced, Dry wines have a tendency to have lower densities, while Medium sweetness wines have higher densities.
(NOTE: I’ve set an upper limit of 23 g/dm3 for residual sugar levels to remove outliers.) All quality levels have a tendency to have lower residual sugar, the excellent more so than the others. This is expected though, as about 43% of the wines in the dataset are in the Dry category.
Here we see the sweetness levels arranged in bands. The smoothing lines for each sweetness level highlight the differences: Dry wines are concentrated at lower density by alcohol, and Medium sweetness wines at higher density by alcohol level.
In the first plot I use the quality levels to get a better image of the overall arrangement. In the faceted plots the quality score is used as colour and we see the arrangement within the quality levels: poor quality wines are predominantly in the lower alcohol/higher density region, while the good quality (score 6) are evenly distributed. Most of the excellent quality wines seem to have higher alcohol and lower density.
The combined plot shows that most wines in the lower alcohol levels are Medium, while at the higher alcohol levels there seems to be an equal distribution of Dry and Medium-Dry wines. The quality score appears to increase as alcohol levels increase.
The facets show how the sweetness levels are distributed across the alcohol levels and quality scores. Dry and Medium-Dry are quite similar in variation across levels of alcohol and also across quality scores; low alcohol wines are more in the lower quality scores, and as alcohol level goes up, the wines tend to be more in the higher quality scores. However, Medium wines are predominantly at the lower end of alcohol for all quality scores.
Wines with higher densities appear to also have lower alcohol levels, darker blues, and wines in the lower densities have higher alcohol levels and also seem to score higher quality scores.
I want to look at the ratio of quality score to alcohol level, as alcohol is the most correlated with quality score.
The boxplot shows the distributions of the quality/alcohol ratios for each quality score. The medians for the quality/alcohol ratios increase across for quality scores, as expected.
The density plot for the quality/alcohol ratio by quality score shows a similar distinction in variations as the boxplot; the wines with the lower scores have similar tendencies to have low quality/alcohol ratios. As quality score increases, the quality/alcohol ratio varies more, but the tendency is that the ratio is higher than the previous score.
The variation between sweetness levels is minimal.
## # A tibble: 4 x 9
## sweetness.level Count Min. Qu_1st Median Mean Qu_3rd Max. SD
## <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Dry 2097 0.242 0.492 0.541 0.543 0.594 0.909 0.0772
## 2 Medium-Dry 1975 0.238 0.522 0.556 0.563 0.612 0.865 0.0722
## 3 Medium 825 0.273 0.543 0.588 0.612 0.667 0.941 0.0958
## 4 Sweet 1 0.513 0.513 0.513 0.513 0.513 0.513 NaN
The medians and IQRs for the ratio of quality to alcohol doesn’t seem to vary a great deal between sweetness levels, although the Dry wines do have the lowest median. Medium-Dry wines have the least variation, while Medium wines have the largest.
## [,1]
## fixed.acidity -0.017260785
## volatile.acidity -0.260100374
## citric.acid 0.054551345
## residual.sugar 0.277082877
## chlorides 0.069552184
## free.sulfur.dioxide 0.205198540
## total.sulfur.dioxide 0.176955861
## density 0.314650471
## pH -0.009481407
## sulphates 0.064787558
## alcohol -0.346440167
## quality 0.686868666
Standard deviation is not calculated for Sweet wines as there is only 1. It looks like this variable does not correlate with the other features of the dataset.
This plot shows distinct, slightly overlapping bands for the quality scores. For any level of density, the pattern of quality scores seems to be regular: higher quality wines have higher quality/alcohol ratios and predominantly lower density levels. Wines with the lowest quality/alcohol ratios also have the lowest quality scores.
For excellent wines, as the quality/alcohol ratio increases, the density also tends to increase. Poor quality wines generally have lower quality/alcohol ratios and slightly higher densities.
Even though volatile acidity is not correlated with alcohol, there is a difference in composition between the poor and excellent wines. This is evident when we facet by quality level to see each distribution; poor wines are clustered more in the region with lower alcohol and there are more wines proportionally that have higher VA levels (above 0.6 g/L). The excellent quality wines are clustered along the lower VA levels and more at the higher alcohol levels.
The density distributions for total sulfur dioxide (TSO2) indicate that excellent quality wines have a stronger tendency to have lower levels of TSO2 than poor quality wines.
Similarily, Dry wines have a tendency to have lower TSO2 levels than do Medium sweet wines.
Density and total sulfur dioxide (TSO2) are moderately correlated with a coefficient of 0.53, so I want to look at how they relate.
NOTE: Dashed lines indicate medians.
Here we see an elongated clustering suggesting a linear relationship, although there’s still a large variation in TSO2 among wines of the same density. We also see some differentiation between the sweetness levels (colour), echoing the density plots from earlier.
The faceted plots show some variation in the distributions of the sweetness levels within the facets: poor wines rated as Medium sweet tend to be in the lower right where density is higher and TSO2 is lower. Dry wines that are rated as excellent seem to be banded at a slightly higher TSO2 levels.
There is some differentiation between the alcohol levels (colour), echoing the density plots from earlier.
If we ignore the single Sweet wine, as it could be considered an outlier (and is the only one in its group), we see that Dry wines have the lowest median for quality/alcohol ratio (quality per alcohol % by volume). This does make sense, as the Dry wines have a tendency to have higher alcohol levels than the other levels, and poor wines have a far greater tendency to have low alcohol levels. The variance is similar for Medium-Dry wines, but greater for Medium wines; standard deviation and IQR are both larger.
The quality vs. alcohol plots show that for Dry and Medium-Dry wines the quality level increases at the same time as alcohol levels increase. This pattern does not hold for Medium wines; most of them are in the lower alcohol levels, and not many in the higher quality range.
When we compare the characteristics of wines of a specific alcohol level (holding alcohol level constant), wines with lower density have higher quality scores.
The density vs. alcohol levels showed distinct bands between sweetness levels. This makes sense as there is a high correlation between density and residual sugar, on which the sweetness levels are based.
At higher levels of volatile acidity, there is a separation in wine scores; holding VA constant (for example at 0.4), wines with higher alcohol levels have higher quality scores.
[skipped]
The distribution of alcohol (% by volume, or ABV) levels in the white wines has a very exaggerated positive skew, with the median at 10.4, in the lower half of the range.
White wines with the highest quality score have the highest median ABV, and the median increases as quality score increases from 5 and up. The variance for the lower quality ratings are similar to each other.
For wines with lower scores, there is a similar tendency to have low quality/alcohol ratios. The highest quality score has a much greater tendency to have the highest quality/alcohol ratio.
There are distinct, slightly overlapping bands for the quality scores. Holding density constant, the quality score increases as quality/alcohol ratio increases. The bands of quality scores (colours) seem to separate more as density increases.
This dataset contains averaged sample data of eleven chemical properties from almost 5000 white wines. I started first by looking at each variable to see how it varied across the wines, and I created a new variable for sweetness levels according to EU regulation 753/2002 6, to use for classifying wines as a factorised version of residual sugar. I also added two simple high/low variables for free sulfur dioxide and volatile acidity to see whether there was any effect in quality, as these features are mentioned on many sites as being the cause of faulty wines if they appear in too high amounts.
After looking at several interactions I found that the most interesting, and most distinct, involved the properties for alcohol, density, and residual sugar - in some cases using the sweetness level instead for comparison. Total sulfur dioxide did seem to have some interesting interactions with density and sweetness levels, but the correlation was moderate. I was surprised that I didn’t find much for free sulfur dioxide or volatile acidity.
The correlation summary for the entire dataset indicated that there weren’t any strong correlations for the quality score, the highest was with alcohol, and that was moderate at best. There were, however, a few strong correlations: alcohol with density was high, and density with residual sugar even stronger. There was also a moderate correlation between density and total sulfur dioxide. In hindsight, I wonder if it would have been good to use a subset of the dataset and remove any outliers for the variables. I also wonder if the outcome would be different with a larger dataset, or one with a more balanced number of quality scores or type of wine (sweetness).
https://www.r-bloggers.com/pie-charts-in-ggplot2/
https://ggplot2.tidyverse.org/reference/coord_polar.html
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis.
Modeling wine preferences by data mining from physicochemical properties.
In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.
Available at:
[@Elsevier] http://dx.doi.org/10.1016/j.dss.2009.05.016
[Pre-press (pdf)] http://www3.dsi.uminho.pt/pcortez/winequality09.pdf
[bib] http://www3.dsi.uminho.pt/pcortez/dss09.bib↩